release: SKaiNET-transformers 0.30.0#177
Merged
Merged
Conversation
FunctionGemma-270M ships as Q5_K_M, but GemmaMemSegConverter dequantized
Q5_K weights to FP32 on load ("no native matmul kernel yet for Q5_K"),
losing the memory savings and the in-kernel dequant. Upstream SKaiNET
0.29.1 now provides a first-class Q5_K packed matmul (Q5_KBlockTensorData
+ Q5KMatmulKernel: scalar/Panama/native), so keep Q5_K packed here too:
relayout GGUF bytes to block-major + wrap as Q5_KBlockTensorData (176 B/
block). Dispatch + lazy transpose reach it via DefaultCpuOps.
- Bump skainet 0.28.1 -> 0.29.1 (source-of-truth for the llm-bom platform).
- settings.gradle.kts: mavenLocal first so a locally-published SKaiNET
0.29.1 (carrying the in-progress Q5_K kernel) shadows Maven Central until
it's released; Central remains the fallback.
Verified (GemmaQ5KPackedParityTest, -PincludeIntegration): the Q5_K packed
path decodes FunctionGemma byte-identically to the FP32 baseline —
[262146, 236769, 3255, 718, 498, 1373, 262152, 106] -> `<tool_0>(state="on")
<end>` for "Turn the light on." (the known-good tool call), 0.81 tok/s on
the JVM host incl. prefill.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ard path The board binary is Kotlin/Native, but GemmaMemSegConverter (the NATIVE_OPTIMIZED packed-weight path) is jvmMain-only (java.lang.foreign). Move the reusable, platform-neutral pieces to commonMain so K/N can keep K-quant weights packed: - GemmaQuantLayout.kt (commonMain): logicalShapeFor + relayoutKSeriesRowMajor ToBlockMajor (now copyInto, KMP-safe) + packGemmaKQuant<T>() which builds heap-packed Q4_K/Q5_K/Q6_KBlockTensorData directly (no MemSeg/Arena). - GemmaMemSegConverter (jvmMain) now shares those commonMain helpers (dup removed); MemSeg/FFM conversion + FP32 fallbacks stay JVM-only. - commonTest GemmaQuantLayoutTest: block-transpose relayout + packing, runs on every target. Verified: gemma compiles for JVM + linuxX64; layout tests green (3). Next (board integration): a commonMain convertGemmaWeightsPacked wired into the K/N load path (byte extraction differs JVM IntArrayTensorData vs native Byte- backed), then a full K/N decode on the SL2610. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oad() NATIVE_OPTIMIZED loads produce raw-byte quant tensors the network mapper can't consume; on JVM an external convertGemmaWeightsToMemSeg (FFM) handled that, but the Kotlin/Native board has no such path. Add a commonMain converter and make load() apply it, so load(NATIVE_OPTIMIZED) yields a runnable network on the board AND the JVM (previously it couldn't be built from raw-byte weights at all). - GemmaPackedWeights.kt (commonMain): convertGemmaWeightsPacked — packs Q4/5/6_K matmul weights to heap Q*_KBlockTensorData (packGemmaKQuant), dequants token_embd/output to FP32 (gathered, no transpose) and other quant types to FP32 [out,in]. No java.lang.foreign. Plus extractRawBytes, which reads the loader's bytes back across both backings (JVM IntArrayTensorData / native Byte-typed). - GemmaNetworkLoader.load(): for NATIVE_OPTIMIZED, run convertGemmaWeightsPacked before applyWeightsToNetwork. Verified on JVM AND linuxX64 (GemmaQuantLayoutTest, 4 tests each): relayout, packing, and the byte-extraction round-trip — so native byte extraction is executed, not just compiled. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Extends GemmaQ5KPackedParityTest to also decode via GemmaNetworkLoader.load(NATIVE_OPTIMIZED) — the wired commonMain convertGemmaWeightsPacked (board) path, no MemSeg/Arena. All three paths (FP32 baseline, jvmMain MemSeg-packed, load() packed) produce the identical token sequence -> `<tool_0>(state="on")<end>` for "Turn the light on." Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Six real-model integration tests (RealGemmaLoad/Eager/BakeIrpa/ExternalParam/ DequantDump + GemmaBehavioralAb) pointed at an old workspace path (/home/miso/projects/coral/sl2610-voice-cc-kt/models/...) and failed with "File not found" under -PincludeIntegration. Repoint them to the actual model location (SKaiNET-embedded/sl2610-function-calling/models/), matching GemmaQ5KPackedParityTest. Verified: all 6 pass against skainet 0.30.0 (mavenLocal), -PincludeIntegration.
Version-aligned with the released SKaiNET 0.30.0 (Q5_K packed matmul, NEON native kernels, Kotlin/Native cinterop), already pinned in the catalog. - gradle.properties: VERSION_NAME 0.28.1 -> 0.30.0. - settings.gradle.kts: revert the mavenLocal()-first dev shim (0.30.0 is on Maven Central; the -PuseLocalSkainet composite build stays for local work). - CHANGELOG.md: add the [0.30.0] entry (Q5_K packed eager runtime, K/N-ready NATIVE_OPTIMIZED Gemma path, kernel-less/Q4_1 dequant fixes) + tag link. - README.md: bump "Current release" + BOM snippet to 0.30.0; add "What's new in 0.30.0". - docs tutorials: bump BOM coordinates 0.28.1 -> 0.30.0. No merge, no tag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`./gradlew build` runs `jvmApiCheck`, which flagged the committed `.api` dumps as stale. Regenerated via `./gradlew apiDump`; all changes reflect public API already present in the source on this branch: - llm-agent: the 0.23.3 prefill-progress callback — `generateUntilStop` gained its `onPrefill` `Function2` param and `AgentListener` gained `onPrefillProgress(Int, Int)`; the dump was never refreshed. - llm-inference/gemma: `convertGemmaWeightsPacked` — the commonMain packed-weight converter added for the Kotlin/Native NATIVE_OPTIMIZED path. - llm-core: trailing `KClass` dtype param on the vendored transformer modules (AttentionImpl / RMSNormalization / GeGLUFFN / MultiHeadAttention / LayerScalarMul / VoidDense) from earlier engine-aligned work. `./gradlew build` now green end-to-end (3m 3s, no failed tasks). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…st blocks
The real-model FunctionGemma-270M integration tests (-PincludeIntegration)
OOM'd with `Java heap space` at the previous 8g default once the model file
is present: GemmaQ5KPackedParityTest holds the FP32 baseline plus both packed
decode networks at once, and the bake-to-irpa test holds weights + serialized
bytes simultaneously.
- Bump the `gemmaTestMaxHeap` default 8g -> 12g.
- Merge the two overlapping `tasks.withType<Test>().configureEach { }` blocks
into one — the second silently overrode the first's maxHeapSize (so jvmArgs
ran with 6g declared but 8g effective). Now jvmArgs, heap, and the seqLen
system property live in a single block.
CI is unaffected: without the model file the integration tests self-skip and
never allocate the headroom. Verified: `:llm-inference:gemma:jvmTest
-PincludeIntegration` green with no -P override (87 tests, 6 skipped, 0
failures); GemmaQ5KPackedParityTest runs.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Prepares the 0.30.0 release, version-aligned with the released SKaiNET 0.30.0 (Q5_K packed matmul, NEON native kernels, Kotlin/Native cinterop). Skips 0.29.x — tracked internally without a tagged release.
Headline
GemmaMemSegConverterused to dequantize Q5_K weights to FP32 on load; the engine now provides a first-class Q5_K packed matmul (Q5_KBlockTensorData+Q5KMatmulKernel), so weights stay packed (176 B/block). FunctionGemma-270M (Q5_K_M) decodes byte-identically to the FP32 baseline (GemmaQ5KPackedParityTest).NATIVE_OPTIMIZEDpath is Kotlin/Native–ready. The layout + packing helpers (GemmaQuantLayout.kt,GemmaPackedWeights.kt) moved tocommonMain, andGemmaNetworkLoader.load()now runsconvertGemmaWeightsPacked— the board binary keeps K-quant weights packed with nojava.lang.foreignMemSeg dependency. Verified on JVM andlinuxX64.NATIVE_OPTIMIZEDnow dequant to FP32[out, in]instead of crashing on a rank-1 transpose;DecoderGgufMemSegConverterdequantizes Q4_1 and every other non-packed quant type (#654).Release prep in this PR
gradle.properties:VERSION_NAME0.28.1 → 0.30.0 (catalogskainetalready pinned to 0.30.0).settings.gradle.kts: reverted themavenLocal()-first dev shim — 0.30.0 is on Maven Central; the-PuseLocalSkainetcomposite build is unchanged.CHANGELOG.md:[0.30.0]entry + tag link.README.md+ doc tutorials: "Current release" / BOM coordinates → 0.30.0; new "What's new in 0.30.0"../gradlew apiDump).jvmApiCheckhad flagged stale dumps; all deltas reflect public API already in the source — the 0.23.3 prefill callback (llm-agent),convertGemmaWeightsPacked(gemma), and theKClassdtype param on the vendored transformer modules (llm-core).Validation
./gradlew build— BUILD SUCCESSFUL in 3m 3s, no failed tasks (compilation, tests, allapiCheckvariants).🤖 Generated with Claude Code